Thorough testing of safety-critical autonomous systems, such as self-driving cars, autonomous robots, and drones, is essential for detecting potential failures before deployment. One crucial testing stage is model-in-the-loop testing, where the system model is evaluated by executing various scenarios in a simulator. However, the search space of possible parameters defining these test scenarios is vast, and simulating all combinations is computationally infeasible. To address this challenge, we introduce AmbieGen, a search-based test case generation framework for autonomous systems. AmbieGen uses evolutionary search to identify the most critical scenarios for a given system, and has a modular architecture that allows for the addition of new systems under test, algorithms, and search operators. Currently, AmbieGen supports test case generation for autonomous robots and autonomous car lane keeping assist systems. In this paper, we provide a high-level overview of the framework's architecture and demonstrate its practical use cases.
translated by 谷歌翻译
As machine learning (ML) systems get adopted in more critical areas, it has become increasingly crucial to address the bias that could occur in these systems. Several fairness pre-processing algorithms are available to alleviate implicit biases during model training. These algorithms employ different concepts of fairness, often leading to conflicting strategies with consequential trade-offs between fairness and accuracy. In this work, we evaluate three popular fairness pre-processing algorithms and investigate the potential for combining all algorithms into a more robust pre-processing ensemble. We report on lessons learned that can help practitioners better select fairness algorithms for their models.
translated by 谷歌翻译
Recent advances in deep learning (dl) have led to the release of several dl software libraries such as pytorch, Caffe, and TensorFlow, in order to assist machine learning (ml) practitioners in developing and deploying state-of-the-art deep neural networks (DNN), but they are not able to properly cope with limitations in the dl libraries such as testing or data processing. In this paper, we present a qualitative and quantitative analysis of the most frequent dl libraries combination, the distribution of dl library dependencies across the ml workflow, and formulate a set of recommendations to (i) hardware builders for more optimized accelerators and (ii) library builder for more refined future releases. Our study is based on 1,484 open-source dl projects with 46,110 contributors selected based on their reputation. First, we found an increasing trend in the usage of deep learning libraries. Second, we highlight several usage patterns of deep learning libraries. In addition, we identify dependencies between dl libraries and the most frequent combination where we discover that pytorch and Scikit-learn and, Keras and TensorFlow are the most frequent combination in 18% and 14% of the projects. The developer uses two or three dl libraries in the same projects and tends to use different multiple dl libraries in both the same function and the same files. The developer shows patterns in using various deep-learning libraries and prefers simple functions with fewer arguments and straightforward goals. Finally, we present the implications of our findings for researchers, library maintainers, and hardware vendors.
translated by 谷歌翻译
Increasingly, malwares are becoming complex and they are spreading on networks targeting different infrastructures and personal-end devices to collect, modify, and destroy victim information. Malware behaviors are polymorphic, metamorphic, persistent, able to hide to bypass detectors and adapt to new environments, and even leverage machine learning techniques to better damage targets. Thus, it makes them difficult to analyze and detect with traditional endpoint detection and response, intrusion detection and prevention systems. To defend against malwares, recent work has proposed different techniques based on signatures and machine learning. In this paper, we propose to use an algebraic topological approach called topological-based data analysis (TDA) to efficiently analyze and detect complex malware patterns. Next, we compare the different TDA techniques (i.e., persistence homology, tomato, TDA Mapper) and existing techniques (i.e., PCA, UMAP, t-SNE) using different classifiers including random forest, decision tree, xgboost, and lightgbm. We also propose some recommendations to deploy the best-identified models for malware detection at scale. Results show that TDA Mapper (combined with PCA) is better for clustering and for identifying hidden relationships between malware clusters compared to PCA. Persistent diagrams are better to identify overlapping malware clusters with low execution time compared to UMAP and t-SNE. For malware detection, malware analysts can use Random Forest and Decision Tree with t-SNE and Persistent Diagram to achieve better performance and robustness on noised data.
translated by 谷歌翻译
飞机行业不断努力在人类的努力,计算时间和资源消耗方面寻求更有效的设计优化方法。当替代模型和最终过渡到HF模型的开关机制均被正确校准时,混合替代物优化保持了高效果,同时提供快速的设计评估。前馈神经网络(FNN)可以捕获高度非线性输入输出映射,从而为飞机绩效因素提供有效的替代物。但是,FNN通常无法概括分布(OOD)样本,这阻碍了它们在关键飞机设计优化中的采用。通过Smood,我们基于平滑度的分布检测方法,我们建议用优化的FNN替代物来编码一个依赖模型的OOD指标,以产生具有选择性但可信度的预测的值得信赖的替代模型。与常规的不确定性接地方法不同,Smood利用了HF模拟的固有平滑性特性,可以通过揭示其可疑敏感性有效地暴露OOD,从而避免对OOD样品的过度自信不确定性估计。通过使用SMOOD,仅将高风险的OOD输入转发到HF模型以进行重新评估,从而以低开销成本获得更准确的结果。研究了三个飞机性能模型。结果表明,基于FNN的代理在预测性能方面优于其高斯过程。此外,在所有研究案例中,Smood的确覆盖了85%的实际OOD。当Smood Plus FNN替代物被部署在混合替代优化设置中时,它们的错误率分别降低了34.65%和计算速度的降低率分别为58.36次。
translated by 谷歌翻译
在飞机系统绩效评估的背景下,深度学习技术可以快速从实验测量中推断模型,其详细的系统知识比基于物理的建模通常所需的详细知识。但是,这种廉价的模型开发也带来了有关模型可信度的新挑战。这项工作提出了一种新颖的方法,即物理学引导的对抗机学习(ML),从而提高了对模型物理一致性的信心。首先,该方法执行了物理引导的对抗测试阶段,以搜索测试输入,以显示行为系统不一致,同时仍落在可预见的操作条件范围内。然后,它进行了物理知识的对抗训练,以通过迭代降低先前未经证实的反描述的不需要的输出偏差来教授与系统相关的物理领域的模型。对两个飞机系统绩效模型的经验评估显示了我们对抗性ML方法在暴露两种模型的身体不一致方面的有效性,并提高其与物理领域知识一致的倾向。
translated by 谷歌翻译
在过去几年中,自动化机器学习(AUTOML)工具的普及有所增加。机器学习(ML)从业人员使用自动工具来自动化和优化功能工程,模型培训和超参数优化的过程。最近的工作对从业人员使用汽车工具的经验进行了定性研究,并根据其性能和提供的功能比较了不同的汽车工具,但是现有的工作都没有研究在大规模实际项目中使用Automl工具的实践。因此,我们进行了一项实证研究,以了解ML从业者如何在其项目中使用汽车工具。为此,我们在GitHub上托管的大量开源项目存储库中研究了最常用的十大汽车工具及其各自的用法。我们研究的结果表明1)ML从业人员主要使用哪种汽车工具,以及2)使用这些汽车工具的存储库的特征。此外,我们确定了使用Automl工具的目的(例如,模型参数采样,搜索空间管理,模型评估/错误分析,数据/功能转换和数据标记)以及ML管道的阶段(例如功能工程)使用工具。最后,我们报告在同一源代码文件中使用Automl工具的频率。我们希望我们的结果可以帮助ML从业人员了解不同的汽车工具及其使用情况,以便他们可以为其目的选择正确的工具。此外,Automl工具开发人员可以从我们的发现中受益,以深入了解其工具的用法并改善其工具以更好地适合用户的用法和需求。
translated by 谷歌翻译
软件测试活动旨在找到软件产品的可能缺陷,并确保该产品满足其预期要求。一些软件测试接近的方法缺乏自动化或部分自动化,这增加了测试时间和整体软件测试成本。最近,增强学习(RL)已成功地用于复杂的测试任务中,例如游戏测试,回归测试和测试案例优先级,以自动化该过程并提供持续的适应。从业者可以通过从头开始实现RL算法或使用RL框架来使用RL。开发人员已广泛使用这些框架来解决包括软件测试在内的各个领域中的问题。但是,据我们所知,尚无研究从经验上评估RL框架中实用算法的有效性和性能。在本文中,我们凭经验研究了精心选择的RL算法在两个重要的软件测试任务上的应用:在连续集成(CI)和游戏测试的上下文中测试案例的优先级。对于游戏测试任务,我们在简单游戏上进行实验,并使用RL算法探索游戏以检测错误。结果表明,一些选定的RL框架,例如Tensorforce优于文献的最新方法。为了确定测试用例的优先级,我们在CI环境上运行实验,其中使用来自不同框架的RL算法来对测试用例进行排名。我们的结果表明,在某些情况下,预实算算法之间的性能差异很大,激励了进一步的研究。此外,建议对希望选择RL框架的研究人员进行一些基准问题的经验评估,以确保RL算法按预期执行。
translated by 谷歌翻译
背景:在各个领域中观察到需求不断增加,以利用机器学习(ML)解决复杂问题。 ML模型作为软件组件实现,并部署在机器学习软件系统(MLSS)中。问题:非常需要确保MLSS的服务质量。这种系统的虚假决定或不良决定会导致其他系统的故障,重大财务损失甚至对人类生命的威胁。 MLSSS的质量保证被认为是一项具有挑战性的任务,目前是一个热门研究主题。此外,重要的是要涵盖MLSS中质量的所有各个方面。目的:本文旨在从从业者的角度研究MLSS中实际质量问题的特征。这项实证研究旨在确定与MLSS质量差有关的坏实践目录。方法:我们计划对从业人员/专家进行一系列访谈,认为访谈是在处理质量问题时检索其经验和实践的最佳方法。我们希望在此步骤中开发的问题目录还将帮助我们以后确定MLSS质量问题的严重性,根本原因以及可能的补救措施,从而使我们能够为ML模型和MLSS开发有效的质量保证工具。
translated by 谷歌翻译
上下文:突变测试(MT)是传统软件工程(SE)白盒测试的重要工具。它旨在在系统中人为地注入故障,以评估测试套件检测它们的能力,假设检测套件缺陷查找功能将转化为实际故障。如果长期以来在SE中使用了MT,那么直到最近它才开始引起深度学习(DL)社区的注意,研究人员将其调整以提高DL模型的可检验性并提高DL系统的可信度。目的:如果提出了几种针对MT的技术,则大多数技术忽略了训练阶段固有的DL的随机性。即使是DL中最新的MT方法,它建议通过统计方法解决MT,也可能会带来不一致的结果。确实,由于它们的统计数据基于固定的采样培训实例,因此在任何情况下都应一致的情况下,它可能会导致在设置中的不同结果。方法:在这项工作中,我们提出了一种概率突变测试(PMT)方法,以减轻不一致的问题,并可以更加一致地决定突变体是否被杀死。结果:我们表明,PMT有效地通过使用三个模型和八个突变算子在先前提出的MT方法中评估来有效地对突变做出更一致和知情的决定。我们还分析了近似错误和方法成本之间的权衡,这表明可以以可管理的成本来实现相对较小的错误。结论:我们的结果表明,DNN当前的MT实践的局限性以及重新考虑它们的需求。我们认为,PMT是朝着该方向迈出的第一步,可以有效消除由DNN训练的随机性引起的先前方法的测试执行的一致性。
translated by 谷歌翻译